Lab book: CoV data summary

Liam Brierley (University of Liverpool)
2020-04-16

Numbers of viruses, sequences

Search terms: (spike[Title] OR “S gene”[Title] OR “S protein”[Title] OR “S glycoprotein”[Title] OR “S1 gene”[Title] OR “S1 protein”[Title] OR “S1 glycoprotein”[Title] OR peplomer[Title] OR peplomeric[Title] OR peplomers[All Title] OR “complete genome”[Title]) NOT (patent[Title] OR vaccine OR artificial OR construct OR recombinant[Title])

host recognised no hosts
no spike sequence 64 948
spike sequence available 53 520
Number of sequences per coronavirus heavily skewed, most just have 1:
. Freq
1 457
2 49
3 7
4 5
5 1
6 5
7 1
8 2
10 2
11 2
12 2
13 2
14 1
15 1
16 2
17 2
19 2
23 1
26 2
27 1
31 2
32 1
33 1
35 1
43 1
51 1
54 1
60 1
66 1
71 1
75 1
86 1
150 1
151 1
172 1
173 1
361 1
393 1
590 1
660 1
739 1
753 1
991 1
3450 1
5915 1
childtaxa_name n_seqs
1581 Human coronavirus OC43 739
1582 Feline coronavirus 753
1583 Middle East respiratory syndrome-related coronavirus 991
1584 Porcine epidemic diarrhea virus 3450
1585 Infectious bronchitis virus 5915

SARS-CoV-2 sequences

Numbers of hosts

Considering only the 573 coronaviruses with available spike protein sequence data…

Number of host species per coronavirus also heavily skewed, as expected:
. Freq
0 520
1 37
2 3
3 4
5 2
6 1
8 1
15 1
18 1
26 1
38 1
48 1
childtaxa_name Hostspp
569 Severe acute respiratory syndrome-related coronavirus 15
570 Alphacoronavirus 1 18
571 Betacoronavirus 1 26
572 Bat coronavirus 38
573 Avian coronavirus 48

Coronaviruses with broadest host range include very wide species that encompass many individual strains..

Number of coronaviruses per host species also heavily skewed:
. Freq
0 1263
1 124
2 20
3 4
4 2
5 2
7 1
23 1
. Freq
1412 mustela putorius 4
1413 rattus norvegicus 4
1414 sus scrofa 5
1415 vicugna pacos 5
1416 homo sapiens 7
1417 rhinolophus sinicus 23

As expected, some commonly studied species (ferret, rat, domestic pig, human), plus livestock (alpaca) and one horseshoe bat, sequences mostly derive from a single study

Number of coronaviruses infecting each host group:

Host groups are mutually exclusive, i.e. primates = non-human primates. Other mammals = misc orders (Proboscidea, Eulipotyphla, Cingulata..)

Not too sure this is very meaningful given how little we know about potential animal hosts of coronaviruses

Sequence data quality

Var1 Freq
complete_spike 0.1495634
partial_spike 0.6674924
whole_genome 0.1829442
complete_spike partial_spike whole_genome
other 1375 430 23099
S 2316 5443 2858
S1 15 5143 0
S2 11 47 0

Excluding partial sequences, summaries of counts of different complete spike protein sequence types per coronavirus (taxid):

0 1 2 3 4 5 6 7 8 10 11 13 14 16 17 19 21 26 27 28 31 33 42 43 44 46 47 53 87 142 164 213 282 484 709 1824
8 315 46 7 5 4 4 1 2 2 1 3 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1

So 5174 individual spike protein sequences across 419 viruses.

NB further complete coding sequences are available that give subunits separately - for S1, 1 virus (Infectious bronchitis virus) and for S2, 2 virus (Infectious bronchitis virus, Porcine epidemic diarrhea virus). Not considering these for now.

Sequence data summaries

Excluding partial sequences, summaries of genomic characteristics per coronavirus (taxid) (i.e. values are averaged within each virus so that each virus represents only one data point):

Only for viruses that have whole genome sequences, mean lengths:

Lengths, ENC, GC content between-coronaviruses in spike and other proteins; within-coronaviruses in spike:


Analysis of Variance Table

Response: length
            Df    Sum Sq Mean Sq F value    Pr(>F)    
taxid      418 347888600  832269   124.1 < 2.2e-16 ***
Residuals 4755  31889888    6707                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fairly consistent sequence lengths of spikes compared to other proteins (expected as pooling all others). Very little within-coronavirus variation.


Analysis of Variance Table

Response: enc
            Df Sum Sq Mean Sq F value    Pr(>F)    
taxid      418  47569 113.801  298.29 < 2.2e-16 ***
Residuals 4755   1814   0.382                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stronger average codon bias in spike than other proteins! Reasonable variation in spike codon biases between-coronaviruses and within some coronaviruses. Human CoV HKU1 more strongly biased than other CoVs.


Analysis of Variance Table

Response: G + C
            Df    Sum Sq Mean Sq F value    Pr(>F)    
taxid      418 134609240  322032  241.42 < 2.2e-16 ***
Residuals 4755   6342721    1334                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GC content slightly lower and slightly more uniform in spikes than in other proteins! Some variation between-coronaviruses, and some variation within-coronaviruses. Human CoV HKU1 and Wencheng shrew Cov more strongly biased than other CoVs.

Mean GC content of spike versus known host range count, labelled as human/nonhuman virus

Not too informative though useful to see which virus is which?

Spike protein composition

Dinucleotide biases do vary in scale - clearly some biases present (TG overrepresented, CG underepresented). But these are pretty consistent between genera.

Reassuring - biases are more extreme at bridge (3-1) dinucleotides as expected. TG, TA, CA overrepresented, GT, GA, AT underepresented. Still sufficient variability to look for signal in

Most obvious thing is use of different stop codons. But otherwise, fairly consistent across genera agaih..

Not convinced amino acid bias is really useful here - it’s just proportion amino acids in the protein sequence, and it’ll be fairly consistent between CoVs..